Conservation, Evolutionary

Mutations occur spontaneously in each generation, randomly changing the amino acid sequences of proteins. Individuals with mutations that impair critical functions of proteins may have resulting problems that make them less able to reproduce. Harmful mutations are lost from the gene pool because the individuals carrying them reproduce less effectively. Over time, only harmless (or very rare beneficial) mutations are maintained in the gene pool. This is evolution.

When the sequences of a given protein are compared between taxa, using multiple sequence alignment (MSA), differences between sequences most often represent mutations that were allowed (by evolution) to persist because they were harmless. Where the sequences are identical, we say that sequence was conserved. Such evolutionary conservation occurs because mutations of these amino acids were harmful to protein function, and were lost over time. Amino acids that are conserved are those most critical to the function of the protein. Thus, looking for evolutionarily conserved patches of amino acids in a 3D protein structure is a good way to locate functional sites.

Proteopedia's evolutionary conservation colors are pre-calculated by ConSurf-DB.

Locating Conserved Patches
Patches of highly conserved amino acid residues on the surface of a protein molecular structure are good candidates for functional sites. Nearly every article in Proteopedia that is titled with a PDB code has an Evolutionary Conservation section below the molecular scene. (Results could not be obtained for a small percentage -- see ConSurfDB Process.) Clicking show in the blue Evolutionary Conservation bar automatically colors all chains in the molecule by evolutionary conservation as calculated by ConSurf-DB.

Briefly, ConSurf-DB gathers sequences similar to that of the protein in question, then constructs a multiple sequence alignment, and analyses it for sequence positions that are conserved (have lower than average differences between sequences) and that are variable (have higher than average differences between sequences). Each amino acid is assigned a conservation score and corresponding color in Proteopedia's interactive 3D molecular scene.

ConSurf-DB's analysis is done with sophisticated, published, peer-reviewed, state of the art methods. A more detailed overview of the process employed by ConSurf-DB is available. Proteopedia's built-in display of ConSurf-DB results is a good place to start looking for conserved patches.

However, as explained below, ConSurf-DB usually does not show all the conserved patches present in proteins with the same function. Therefore, you may wish to extend your analysis of conservation by limiting the analysis to proteins of one function, using the ConSurf Server, as explained below. The results of such an analysis can be displayed in a molecular scene in Proteopedia. See Help:How to Insert a ConSurf Result Into a Proteopedia Green Link.

Locating Variable Patches
In some cases, patches of highly variable (rapidly mutating) residues are also functional sites. These can also be identified preliminarily with Proteopedia's Evolutionary Conservation scenes from ConSurfDB, and more definitively with conservation analysis limited to proteins of a single function. For example, mutations in influenza hemagglutinin help the virus to evade host defenses (see 1hgf). Another example is the high allelic variability of the peptide-binding groove of Major Histocompatibility Complex Class I. That variability helps the grooves of the alleles within any individual to bind a wide range of peptides, hence enabling the T lymphocyte system to defend against a wide range of pathogens, including influenza virus.

Conservation for Domain Folding
Certain residues on the surfaces of protein molecules tend to be conserved in order to maintain proper folding, rather than because they are part of a site functioning to interact with substrate, ligand, or a protein partner. Secondary structure elements need to break at the protein molecular surface in order to turn back into the folded protein domain. Therefore, it is common to see isolated highly conserved residues that enable turns, or break helices, notably glycines or prolines, on protein structure surfaces.

Remember that you can touch any residue with the mouse in the Evolutionary Conservation scene in Proteopedia (in Jmol), and its identity will be displayed after a few seconds. This works best with spinning turned off.

Every structure in Proteopedia has a link to be displayed in FirstGlance in Jmol. There, you can use the Find dialog to enter the name of an amino acid, e.g. glycine or proline, and the positions of all of the specified amino acids will be highlighted. You can then visualize their distribution in the 3D structure. This strategy can also be utilized when viewing the protein colored by conservation, using the FirstGlance links in either ConSurf server.

ConSurf-DB Often Obscures Some Functional Sites
Proteopedia's Evolutionary Conservation scenes use pre-calculated results from ConSurf-DB. ConSurf-DB is designed to include a wide range of sequences in its multiple-sequence alignments (MSA) and analyses. Often, the MSA will a include substantial number of sequences for proteins with different functions than the query protein. (See these instructions for how to find out the functions of the proteins used in ConSurf-DB's MSA.) Consequently, amino acids that are colored as highly conserved by ConSurf-DB are truly highly conserved across a wide range of sequence-similar proteins. However, amino acids that are highly conserved in proteins with the same function as the query protein may not appear conserved in ConSurf-DB results. A good way to find these obscured functional sites is to do a conservation analysis that is limited to proteins of a single function. See Limiting ConSurf Analysis to Proteins of a Single Function.

Use Caution When Comparing Conservation of Sequence-Different Chains
This caveat applies only to molecules that contain chains with different sequences. The conservation colors shown in Proteopedia's Evolutionary Conservation scenes do not indicate the same levels of conservation for chains of different sequences. This is because ConSurf-DB calculates conservation levels independently for each sequence-different chain, and the levels are relative to the multiple sequence alignment constructed for each sequence-independent chain.

For example, consider 1bqh, which contains 10 chains, representing two copies of a 5-chain molecule. Each molecule contains four sequence-different chains. A visit to ConSurf-DB reveals, as expected, that a different number of sequences was utilized for the multiple sequence alignment (MSA) and conservation calculations for each of these sequence-different chains, and that each MSA had a different average pairwise difference (APD), a measure of diversity within the MSA. Therefore, residues with, for example, conservation level 9 (maximal conservation) in each of the three ConSurf-DB-colored sequence-different chains have the highest levels of conservation within their own chain, but do not have exactly the same absolute levels of conservation.

In Proteopedia's Evolutionary Conservation scenes, all the chains in the molecule are colored in the same scene. This gives a potentially useful overview, but can be misleading unless one realizes that a given conservation color, in two sequence-different chains, does not mean exactly the same level of conservation. In contrast to Proteopedia's Evolutionary Conservation scenes, ConSurf-DB and ConSurf Server apply conservation level colors to only one chain sequence at a time, thereby avoiding this possible confusion.

Conservation Results Will Change With Time
Slight variations in the conservation pattern will occur over time, as the number of sequences in the sequence databases used by ConSurf-DB increase. Each update of ConSurf-DB uses somewhat larger sequence databases, and consequently, the MSA's for each chain will be slightly different. Also, the methods employed by ConSurf are improved periodically. For example, the MSA algorithm originally defaulted to CLUSTAL-W, then to MUSCLE, and later to MAFFT.

Consequently, results from the ConSurf Server will also change slightly with time, even when the job parameters are the same. Only if you upload the same MSA will the results be identical for a given chain when the jobs are run months or years apart.

You may find it useful to download ConSurf results (from either ConSurf server) in order to preserve a particular result for comparison with results obtained at later times.

INTREPID
&quot;INTREPID is an information-theoretic approach for functional site identification that exploits the information in large diverse multiple sequence alignments. INTREPID gathers homologs for a sequence using PSI-BLAST and estimates a phylogenetic tree. It then uses Jensen-Shannon divergence to measure the information for each position in the sequence at each subtree node encountered on a traversal of the phylogeny, tracing a path from the root to the leaf corresponding to the sequence of interest. Positions that are conserved across the entire family receive stronger scores than those that only become conserved within more closely related subgroups. This tree traversal produces a phylogenomic conservation score for each position in the MSA. INTREPID uses information from sequence only, and can thus be used when knowledge of structure is not available.&quot; (Quoted from the INTREPID website.)

INTREPID accepts a protein chain sequence as input. It offers to color conserved residues on 3D protein structures in Jmol. The 3D structures are obtained (when available) from the Protein Data Bank by sequence alignment searching, and users may choose from a menu of hits.

Evidence is provided that INTREPID out-performs ConSurf for predicting catalytic residues.

Unlike ConSurf, INTREPID does not identify the most variable residues in addition to the most conserved.

siteFiNDER|3D
siteFiNDER|3D performs conserved functional group (CFG) analysis. "CFG Analysis is a general method for predicting the location of functionally important sites within a target protein structure. Like other available structure/sequence analysis techniques, CFG Analysis exploits the evolutionary relationships present across groups of homologous proteins to identify regions that are likely to be of functional significance. However, this technique is particularly useful for situations where other methods fail, for instance when only a few or highly similar homologues can be identified." As its name implies, CFG analysis attempts to identify groups of conserved amino acids that together represent a functional site. In this respect, it goes beyond most other evolutionary conservation servers, which stop at assigning a conservation value to each amino acid. See the comparison of siteFiNDER|3D with ConSurf for cytochrome c.

This site provides links to several other software packages that predict functional sites, some of which are not further discussed in the present article.

HotPatch
HotPatch "finds unusual patches on the surface of proteins, and computes just how unusual they are (patch rareness), and how likely each patch is to be of functional importance (functional confidence (FC).) The statistical analysis is done by comparing your protein's surface against the surfaces of a large set of proteins whose functional sites are known." One advantage of HotPatch is that sequence homologs are not required. See the comparison of HotPatch with ConSurf for cytochrome c.

Evolutionary Trace Viewer
Evolutionary Trace Viewer (ETV). See the comparison of ETV with ConSurf for cytochrome c.

Comment by User:Eric Martz, March, 2009: From the information provided on the ETV website, I found it quite difficult to understand what the ETV is doing, or how to use the viewer. An explanation in simple terms for non-specialists would be very useful.